Case Study: AllLife bank using Logistic Regression | Decision Tree


Context:

AllLife Bank is a US bank that has a growing customer base.

Problem:

Objective:

Explore the dataset and extract insights from the data.

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Data Dictionary:


Loading libraries

Loading and exploring the data

Loading the data into python to explore and understand it.

Overview of the data

ID. is just an index for the data entry. In all likelihood, this column will not be a significant factor in determining if customer will by or not a personal loan. Therefore, we will not drop this variable just yet. Let us see if there is any relationship with the price when we do bivariate analysis.

ZIPCode Let us check how many unique values we have. If they are too many, we gonna need some processing to extract important information, binning and convert as categorical variable.

Personal Loan is our Target Variables and the variable which we need to predict.

Observation:

We have an imbalanced data set, with 90.5% of customers as Won't by a personal loan and only 9.5% that Will by a personal loan.

Checking for duplicates in the data.

Check the data types of the columns for the dataset.

Observations:

Processing Columns

1. ZIP Code

It is a numeric value but it's also essentially a category. We gonna treat values mapping zip codes to different locations.

Observation:

It still a lot of unique values for county, lets try group it by North and South.

Check for missing values

Observations:

Observation:

  1. Customers seems to be balanced distribuited between North and South of California, meaning that it not gonna be a significante variable

Exploratory Data Analysis

Give a statistical summary for the dataset.

Age values in a high range. We should check a few of the extreme values to get a sense of the data.

Experience min and max values warrant a quick check, also there is negative values that we need to treat it on Data cleaning.

Income min is to low for a year income, we need to check it and also max seems to be to high.

north_south seems to be balanced.

Family, Education are Ordinal Categorical we gonna keep it as numerical and procide with Label Encoding considering that there is a sense of order on the values.

Securities_Account, CD_Account, Online, CreditCard are Binary Categorical, we gonna keep it as numerical and procide with Label Encoding.

Data Cleaning

1. Experience by Age

We gonna check the min and max values of Experience by Age

Observation:

  1. Some customers has negative experience time, which does not make sense, we gonna need to treat it as missing values.
  2. Young customers has less experience than older customers, which makes sense.

Observation:

  1. 12 Customers with Age 23, has negative experience time, lets replace it with 0 Experience.
  2. 17 Customers with Age 24, has negative experience time, and mean experience is 0, lets replace it with 0 Experience.
  3. 18 Customers with Age 25, has negative experience time, and mean experience is less than 1, lets replace it with 0 Experience.
  4. 1 and 3 Customers with Age26 and 29, has negative experience time, and mean experience is less 1 and less 4 respectively, lets also replace it with 0 once the number of customer is not expressive.

2. Income

We gonna check the min and max values of Income by Age and Experience

Observation:

  1. There are all king of customers (experience time / age) with low income.
  2. Could be an error, or customer who did not work the whole year or have part-time job.
  3. There are no pattern, to explain it.

Observation:

  1. It is unlikely that a person with less than 6 years of experience have an income greater than 185 thousand per year, so we gonna drop this 17 customers (rows).
  2. Customers with more than 5 years experience we can consider that the have a income greater than 185 thousand per year and we gonna treat it as real information ( NO outlier treatment).

Data Visualization: Univariate Analysis

Obsrvation:

Obsrvation:

Obsrvation:

Obsrvation:

Observation:

Observation:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Bivariate analysis

Correlation between numeric Variables

Observation:

Observation:

Observation

Observations:

Observation

Observation

Observation

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Data Pre-Processing

Dropping ID

Outliers detection using boxplot


Logistic Regression


Creating a function to split, encode and add a constant to X

Building the Logistic Regression model

Model evaluation criterion:

Insights:

How to reduce losses?

Logistic Regression (with Sklearn library)

Logistic Regression (with statsmodels library)

We will need to remove multicollinearity from the data to get reliable coefficients and p-values.

Multicollinearity

1. Removing Age

AGE has the highest VIF, let's remove it and check other values

Model with all the features except Age

Recall on training set : 0.6576 to 0.6545

Recall on test set : 0.6233 to 0.6164

2. Removing Experience

Experience has a high VIF, if we keep Age. Let's remove it and check if our model performe better without Experience.

Model with all the features except Experience

Recall on training set : AGE:0.6545 | Experience: 0.6545

Recall on test set : AGE:0.6164| Experience: 0.6164

Summary of the model

Metrics of final model 'lg6'

ROC-AUC

Coefficient interpretations

Converting coefficients to odds

Odds from coefficients

Percentage change in odds

Coefficient interpretations

Interpretation for other attributes is similarly.

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Let's use Precision-Recall curve and see if we can find a better threshold

Model Performance Summary

Conclusion


Decision Tree


Data Preparation

Split Data

Build Decision Tree Model

We only have 9.5% of positive classes, so if our model marks each sample as negative, then also we'll get 90.5% accuracy, hence accuracy is not a good metric to evaluate here.

Insights:

How to reduce losses?

Visualizing the Decision Tree

Reducing Overfitting

Visualizing the Decision Tree

Cost Complexity Pruning

Maximum value of Recall is around 0.010 alpha, but if we choose decision tree will only have 3 decision root and we would lose the business rules, and our type I erro gonna be high (precission is low). Instead we can choose alpha 0.00135 retaining information and getting a higher recall and a good precision.

Visualizing the Decision Tree

Creating model with 0.00135 ccp_alpha

Observation:

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusion

Perform an Exploratory Data Analysis on the incorrectly predicted data

As our Target Column(Personal Loan) is in middle of dataframe, we gonna drop it and appended at the end, so we can compar it to Personal Loan Predict

Observations:

Conclusions

• We analyzed the "Personal Loan accepting" using different techniques and used Logiatic Regression and Decision Tree Classifier to build a predictive model for the same.

• The built model can be used to predict if a customer is going to accept or not a Personal_Loan and to create a Customer segment considering the significance of independente variables

• We visualized different trees and their confusion matrix to get a better understanding of the model. Easy interpretation is one of the key benefits of Decision Trees, followed by less data pre-processing (outliers, missing data, features engenneiring...). In the other hand Logistic Regression request all this data pre processing and it is difficult to correctly interpret the results and outliers affects the model. logistic regression looks at the simultaneous effects of all the predictors, so can perform much better with a small sample size.

• We verified the fact that how much less data preparation is needed for Decision Trees and such a simple model gave good results even with outliers and imbalanced classes which shows the robustness of Decision Trees.

• The models predicted in logistic regression was 0.8973 on test and the decision tree was 0.9521 on test. The data sets did better with Decision Tree.

• Income, Family, CCAvg, Education, and Age are the most important variable in predicting the customers that will accept the Personal Loan

• We used post pruning to reduce overfitting a and choose the alpha 0.00135 to get a best Recall considering a good Precision as well. Meaning that, Recall is the most important metric, but we also wanna a good precision to make sure we are targeting the correct customers.

Recommendations | Insight

• According to the Decision Tree model, pos tunning bestmodel2:

a) If a customer Income is greater than 100k, there's a very high chance the customer will accept the Personal Loan.

b) If a customer has high Income and Family size greater or equal to 3, there is a high chance of this customer accept a Loan